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Abstract 

Recently several authors have proposed stochastic evolutionary models for the growth 
of complex networks that give rise to power-law distributions. These models are based 
on the notion of preferential attachment leading to the "rich get richer" phenomenon. 
Despite the generality of the proposed stochastic models, there are still some unexplained 
phenomena, which may arise due to the limited size of networks such as protein and e- 
mail networks. Such networks may in fact exhibit an exponential cutoff in the power-law 
scaling, although this cutoff may only be observable in the tail of the distribution for 
extremely large networks. We propose a modification of the basic stochastic evolutionary 
model, so that after a node is chosen preferentially, say according to the number of its 
inlinks, there is a small probability that this node will be discarded. We show that as a 
result of this modification, by viewing the stochastic process in terms of an urn transfer 
model, we obtain a power-law distribution with an exponential cutoff. Unlike many other 
models, the current model can capture instances where the exponent of the distribution 
is less than or equal to two. As a proof of concept, we demonstrate the consistency of 
our model by analysing a yeast protein interaction network, the distribution of which is 
known to follow a power law with an exponential cutoff. 

1 Introduction 

Power-law distributions taking the form 

f{i)=Ci~\ (1) 

where C and r are positive constants, are abundant in nature |SorOO| . The constant r is called 
the exponent of the distribution. Examples of such distributions are: Zipf's law, which states 
that the relative frequency of words in a text is inversely proportional to their rank, Pareto 's 
law, which states that the number of people whose personal income is above a certain level 
follows a power-law distribution with an exponent between 1.5 and 2 (Pareto's law is also 
known as the 80:20 law, stating that about 20% of the population earn 80% of the income) and 
Gutenberg-Richter's law, which states that, over a period of time, the number of earthquakes 
of a certain magnitude is roughly inversely proportional to the magnitude. Recently, several 
researchers have detected power-law distributions in the topology of various networks such as 
the World- Wide- Web jRKM+OOl lKRR+00| and author citation graphs |ReH98j . 
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The motivation for the current research is two-fold. First, from a complex network per- 
spective, we would like to understand the stochastic mechanisms that govern the growth of a 
network. This has lead to fruitful interdisciplinary research by a mixture of Computer Scien- 
tists, Mathematicians, Statisticians, Physicists, and Social Scientists |AB021 IDMOOl IKRL001 
IJNewOll lPFL + 02" , who are actively involved in investigating various characteristics of complex 
networks, such as the degree distribution of the nodes, the diameter, and the relative sizes of 
various components. These researchers have proposed several stochastic models for the evo- 
lution of complex networks; all of these have the common theme of preferential attachment 
— which results in the "rich get richer" phenomenon — for example, where new links to ex- 
isting nodes are added in proportion to the number of links to these nodes currently present. 
Considering the web as an example of a complex network, one of the challenges in this line of 
research is to explain the empirically discovered power-law distributions |AH01| . It turns out 
that the evolutionary model of preferential attachment fails to explain several of the empirical 
results, due to the fact that the exponents predicted are inconsistent with the observations. 
To address this problem, we proposed in |LFLW02] an extension of the stochastic model for 
the web's evolution in which the addition of links utilises a mixture of preferential and non- 
preferential mechanisms. We introduced a general stochastic model involving the transfer of 
balls between urns that also naturally models quantities such as the numbers of web pages in 
and visitors to a web site, which are not naturally described in graph-theoretic terms. 

Another extension of the preferential attachment model, proposed in |DM0fl| . takes into 
account the ageing of nodes, so that a link is connected to an old node, not only preferentially, 
but also depending on the age of the node: the older the node is, the less likely it is that other 
nodes will be connected to it. It was shown in jDMOOj that if the ageing function is a power 
law then the degree distribution has a phase transition from a power-law distribution, when 
the exponent of the ageing function is less than one, to an exponential distribution, when the 
exponent is greater than one. A different model of node ageing was proposed in |ASBS00] with 
two alternative ageing functions. With the first function the time a node remains 'active', i.e. 
may acquire new links, decays exponentially, and with the second function a node remains 
active until it has acquired a maximum number of links. Both functions were shown by 
simulation to lead to an exponential cutoff in the degree distribution, and for strong enough 
constraints the distribution appeared to be purely exponential. Another explanation for 
a cutoff, offered in |MBS A02] . is that when a link is created the author of the link has 
limited information processing capabilities and thus only considers linking to a fraction of 
the existing nodes, those that appear to be "interesting". It was shown by simulation that 
when the fraction of "interesting nodes" is less than one there is a change from a power-law 
distribution to one that exhibits an exponential cutoff, leading eventually to an exponential 
distribution when the fraction is much less than one. 

A second motivation for this research is that the viability and efficiency of network algo- 
rithmics are affected by the statistical distributions that govern the network's structure. For 
example, the discovered power-law distributions in the web have recently found applications 
in local search strategies in web graphs |ALPHflI] , compression of web graphs |AM01j and an 
analysis of the robustness of networks against error and attack |AJB001 IJMBUOl] . 

Despite the generality of the proposed stochastic models for the evolution of complex 
networks, there are still some unexplained phenomena; these may arise due to the limited size 
of networks such as protein, e-mail, actor and collaboration networks. Such networks may in 
fact exhibit an exponential cutoff in the power-law scaling, although this cutoff may only be 
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observable in the tail of the distribution for extremely large networks. The exponential cutoff 
is of the form 



with < q < 1. The exponent r in Q will be smaller than the exponent that would be 
obtained if we tried to fit a power law without a cutoff, like (^Q), to the data. Unlike many 
other models leading to power-law distributions, models with a cutoff can capture situations 
in which the exponent of the distribution is less than or equal to two, which would otherwise 
have infinite expectation. 

An exponential cutoff has been observed in protein networks |.TMB()f)T] . in e-mail networks 
|EMB02j . in actor networks |ASBS00j . in collaboration networks |NewOH IGro02j . and is 
apparently also present in the distribution of inlinks in the web graph |MBS A02] . where a 
cutoff had not previously been observed. We believe it is likely, in many such cases where 
power-law distributions have been observed, that better models would be obtained with an 
exponential cutoff like ©, with q very close to one. 

The main aim of this paper is to provide a stochastic evolutionary model that results 
in asymptotic power-law distributions with an exponential cutoff, thus allowing us to model 
discrete finite systems more accurately and, in addition, enabling us to explain phenomena 
where the exponent is less than or equal to two. As with many of these stochastic growth 
models, the ideas originated from Simon's visionary paper published in 1955 |Sim55j . At the 
very beginning of his paper, in equation (1.1), Simon observed that the class of distribution 
functions he was about to analyse can be approximated by a distribution like ©; he called 
the term q % the convergence factor and suggested that q is close to one. He then went on 
to present his well-known model that yields power-law distributions like (Q), and which has 
provided the basis for the models rediscovered over 40 years later. Simon gave no explanation 
for the appearance, in practice, of the convergence factor. 

Considering, for example, the web graph, the modification we make to the basic model to 
explain the convergence factor is that after a web page is chosen preferentially, say according 
to the number of its inlinks, there is a small probability that this page will be discarded. 
A possible reason for this may be that the web page has acquired too many inlinks and 
therefore needs to be redesigned, or simply that an error has occurred and the page is lost. 
Other examples are e-mail networks, where new users join and old users leave the network, 
and protein networks, where proteins may appear or disappear from the network over time. 

Networks with an exponential cutoff fall into two categories. The first category of network, 
which includes actor and collaboration networks, is monotonically increasing, i.e. nodes and 
links are never removed from such networks. In this category nodes can be either active, 
in which case they can be the source or destinations of new links, or inactive in which case 
they are not involved in any new links from the time they first become inactive. The second 
category of network, which includes the web graph, e-mail and protein networks, is non- 
monotonic, i.e. links and nodes may be removed. In this paper we consider the second 
category of network, but only allow node deletion. (In |FLL05| we considered the case where 
only link removal is allowed and showed that in that case the degree distribution follows a 
power law.) The first category of network (which also exhibits an exponential cutoff in the 
degree distribution) will be dealt with in a follow-up paper. 

The rest of the paper is organised as follows. In Section [21 we present an urn transfer 
model that extends Simon's model by allowing a ball to sometimes be discarded. In Section ED 
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we derive the steady state distribution of the model, which, as stated, follows an asymptotic 
power law with an exponential cutoff like In Section we demonstrate that our model 
can provide an explanation of the empirical distributions found in protein networks. Finally, 
in Section [5] we give our concluding remarks. 

2 An Urn Transfer Model 

We now present an urn transfer model |JK77j for a stochastic process that emulates the 
situation when balls (which might represent, for example, proteins or email accounts) are 
discarded with a small probability. This model can be viewed as an extension of Simon's 
model |Sim55j . where either a ball is added to the first urn with probability p, or some ball 
is moved along from the urn it is in to the next urn with probability 1 — p. We assume 
that a ball in the ith urn has i pins attached to it (which might represent, for example, 
interactions between proteins or e-mail messages between email accounts). We note that 
there is a correspondence between the Barabasi and Albert model |BA99j . defined in terms 
of nodes and links, and Simon's model, defined in terms of balls and pins, as was established 
in |BE01j . Essentially, the correspondence is obtained by noting that the balls in an urn can 
be viewed as an equivalence class of nodes all having the same connectivity (i.e. degree). 

We assume a countable number of urns, urn\,urn2,urnj n .... Initially all the urns are 
empty except urn\, which has one ball in it. Let Fi{k) be the number of balls in urrii at 
stage k of the stochastic process, so -Fi(l) = 1 and all other Fi(l) = 0. Then, at stage k + 1 
of the stochastic process, where k > 1, one of two things may occur: 

(i) with probability p, < p < 1, a new ball (with one pin attached) is inserted into urni, 
or 

(ii) with probability 1 — p an urn is selected, with urrii being selected with probability 
proportional to iFi(k) (i.e. urrii is selected preferentially in proportion to the total 
number of pins it contains), and a ball is chosen from the selected urn, urni say; then, 

(a) with probability q, < q < 1, the chosen ball is transferred to urni + i, (this is 
equivalent to attaching an additional pin to the ball chosen from urni), or 

(b) with probability 1 — q the ball chosen is discarded. 

The expected total number of balls in the urns at stage k is given by 

k 

E(jr,Fi(k)) = l + (fe-l)(p-(l-p)(l-g)) 

i=l 

= (l-p)(2-q) + k(p-(l-p)(l-q)). (3) 

We note that we could modify the initial conditions so that, for example, urni initially 
contained 5 > 1 balls instead of one. It can be shown, from the development of the model 
below, that any change in the initial conditions will have no effect on the asymptotic distri- 
bution of the balls in the urns as k tends to infinity, provided the process does not terminate 
with all of the urns empty. 
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To ensure that, on average, more balls are added to the system than are discarded, on 
account of (jSJ we require p > (1 — p)(l — q), which implies 

1 - 2p 

q> -: ; 

1-p 

this trivially holds for p > 1/2. 

From now on we assume that this holds. This constraint implies that the probability that 
the urn transfer process will not terminate with all the urns being empty is positive. More 
specifically, the probability of non-termination is given by 

l-(£^>)'. (4, 

where 5 is the initial number of balls in urrii; this is exactly the probability that the gambler's 
fortune will increase forever |Ros83j . 

The total number of pins attached to balls in urrii at stage k is zi^(fc), so the expected 
total number of pins in the urns is given by 

k k-1 

E(j2 iF i( k )) = l + (k-l)(p+(l-p)q)-(l-p)(l-q)Y / O j 
i=l j=l 

k—1 

= k(p+(l-p)q)-(l-p)(l- q )(j2^- 1 )^ (5) 

5=1 

where 9j, 1 < j < k — 1, is the expectation of Qj, the number of pins attached to the ball 
chosen at step (ii) of stage j (i.e. the urn number). So 

As a consequence we have 

1 < Oj < j, 

since at stage j there cannot be more than j pins in the system. 
Now let 

i=i 

Since there are at least as many pins in the system as there are balls, it follows from (j3J) and 
© that 

1 < 0(*) < (7) 

So, since 0^ k > is bounded, we will make the reasonable assumption that 9^ tends to a 
limit 9 as k tends to infinity, i.e. 

lim 9 {k) = 9. 

k~*oo 
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Letting k tend to infinity in (J7J) gives 

i< e < 



l-q 



In the next section we demonstrate through simulation of the stochastic process that our 
assumption that 6^ converges appears to hold. We also explain how the asymptotic value 9 
may be obtained, assuming that the limit exists. 

3 Derivation of the Steady State Distribution 

Following Simon |Sim55| . we now state the mean- field equations for the urn transfer model. 
For i > 1 we have 

E k (Fi(k + 1)) = Fi(fc) + k (q(i - l)Fi_i(fc) - iFi(k)) , (8) 



where E k (Fi(k + 1)) is the expected value of Fi(k + 1) given the state of the model at stage 
k, and 

1 — p 
:?=i iFi(k 

is the normalising factor. 



Pk = —k ^77T ( 9 ) 



Equation (JSJ gives the expected number of balls in urrii at stage k + 1. This is equal to 
the previous number of balls in urrii plus the probability of adding a ball to urrii minus the 
probability of removing a ball from urrii. The former probability is just the probability of 
choosing a ball from urrii^i and transferring it to urrii in step (ii)(a) of the stochastic process 
defined in Section El whilst the latter probability is the probability of choosing a ball from 
urrii in step (ii) of the process. 

In the boundary case, % = 1, we have 

E k (F 1 (k + l))=F 1 (k)+p-p k Fi(fc), (10) 

for the expected number of balls in urni, which is equal to the previous number of balls in 

the first urn plus the probability of inserting a new ball into um\ in step (i) of the stochastic 
process defined in Section [2] minus the probability of choosing a ball from urri\ in step (ii). 

In order to solve the equations for the model, we make the assumption that, for large k, 
the random variable {3 k can be approximated by a constant (i.e. non-random) value depending 
only on k. We take this approximation to be 

R - l ~ P 
Pk 



k (p + (1 - p)q - (1 -p)(l - q)0( k )) 



The motivation for this approximation is that the denominator in the definition of ft k has 
been replaced by an asymptotic approximation of its expectation as given in (JjJJ). We observe 
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that replacing /3k by (3k results in an approximation similar to that of the ll pk model" in 
|LFLW02] . which is essentially a mean-field approach. 

We can now take expectations of (JSJ and (|1U|) . Thus, by the linearity of the expectation 
operator E(.), we obtain 

E(Fi(k + 1)) = E(Fi(k)) + k (q(i ~ l)£(Fi_i(fc)) - iE(Fi(k)j) (11) 

and 

E(F 1 (k + l))=E(F 1 (k))+p-P k E(F 1 (k)). (12) 



In order to obtain an asymptotic solution of equations (fTTj) and (|T2*|) . we require that 
E(Fi(k))/k converges to fi as k tends to infinity. Suppose for the moment that this is the 
case, then, provided the convergence is fast enough, E{Fi{k + 1)) — E(Fi(k)) tends to By 
"fast enough" we mean e^fc+i — is o(l/k) for large k, where 

E(F i (k)) = k(f i + e i! k). 



Now, letting 

P p+(l-p), ? -(l-p)(l-^' 1 ; 

we see that (3kE{Fi{k)) tends to (3fi as k tends to infinity. 
So, letting k tend to infinity, Qll|) and yield 

f i = /3{q(i-l)f^ 1 -if i ), (14) 

for i > 1, and 

fi=p-Pfi. (15) 



Following the lines of the proof given in the Appendix of |LFLW02| . we can show that e,- 1. 
tends to zero as k tends to infinity provided we make the further assumption that 

\0-9W\< -, 
k 

for some constant c. In other words, this assumption states that the expected number of pins 
attached to the balls chosen in the first k stages of the stochastic process is within a constant 
of the asymptotic expected number of pins attached to the chosen ball multiplied by k, i.e. 

k 

k6-c<J2 e j <k6 + c. 

3=1 



In order to verify the convergence we ran some simulations; these will be discussed at the 
end of this section. 
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Provided that (3k can be approximated by (3k for large k, then, under the stated assump- 
tions, fi is the asymptotic expected rate of increase of the number of balls in urnf, thus the 
asymptotic proportion of balls in urrii is proportional to fi. 

From (|14[) and (|15|) we obtain 

(3q(i-l) g(i-l) 

f* - TTip h ~ x ~ ~7T7~ h ~ x (16) 

and 

f _ P ^P h n 

~ TTp ~ IT? (17) 

where g = Now, on repeatedly using (fTB|) . we get 

= g g <r 1 1 2 ■ • • (i - 1) = gP r(i + g ) r(i) g* 
Jj (l + q) (2 + e ) ■ • • (i + e ) q r(« + 1 + e ) ' 1 j 

where T is the gamma function |AS72l 6.1]. 

Thus for large i, on using Stirling's approximation | AS 721 6.1.39], we obtain fi in a form 
corresponding to (J2j): 

f*~%* ( 19 ) 

where ~ means is asymptotic to, and 

= QpT{l + Q) 



From (|18[) . it follows that 



~ e P r(2 + p) ~ r(j + 1) r(j + 1) 
t[ 1 + p r(j + 2 + Q ) j\ 

Q P 



l + Q 



F{l,l;2 + g;q), (20) 



where F is the hypergeometric function |AS72| 15.1.1]. From (|20|) it is immediate that the 
first moment is given by 

QP 

i=l * + 5 

and the second moment is given by 



Y,ift = -^ F(l,2;2 + e;q), (21) 



E*Vi= i^r- ^(2,2:2 + ^9). (22) 
i=i 



Under the assumptions we have made for the steady state distribution, using (JBJ), 1)21(1 
and (|22|) we obtain 

_ F(2,2;2 + g ;g) 

"-F(l,2;2 + g ;(7 )- ^ j 
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In the special case when q = 1, which is Simon's original model, using the fact that in 
this case g = 1/(1 -p), we obtain by Gauss's formula |AS721 15.1.20] 

oo 
i=l 

as expected, and 

oo ^ 

which is valid only if p > 0.5, i.e. g > 2. 

Letting A; tend to infinity in © and using (|13j) we obtain 

oo 

i=i p 

Together with (|21|). this gives the following equation for g in terms of p and 

(l-p)(l + e )=pF(l,2;2 + e;? ). (24) 



This equation may be solved numerically to obtain the value of g, and 9 can then be 
obtained from Q13|) or (|23|). (It can be shown that, by virtue of (|24|). both equations yield the 
same value for 0.) 

In order to verify the convergence assumptions we ran several stochastic simulations. For 
p = 0.3 and q = 0.975, the number of aborts (i.e. computations terminating because all the 
urns were empty) predicted by (jlj) is about 5.8%. We ran the simulations 1000 times, each 
run for 1000 steps. There were 49 aborts, with the maximum number of steps before aborting 
being 12 and the average 4.14. 

Figure ^ shows a summary of ten runs, each of half a million steps, with the above 
parameters, plotting Qj against j. The bottom plot is the minimum Qj over the ten runs, the 
middle plot is the average and the top plot is the maximum. The asymptotic average value 
of Qj was 11.1902. On the other hand, computing 9 from ()23|) . averaged over ten runs, with 
g taken to be (/c/3fc) -1 , gave 11.1762. As a final check, computing g from (|24|) and 9 from (|23[) 
or (|13[) gives 11.1753. In further simulations, similar plots were obtained for other values of 
p and q. 

As a further validation, we ran the stochastic simulation for 10,000 steps, repeated 100 
times, with parameters p = 0.3 and q = 0.975 as above, and compared it with a deterministic 
computation of the process using ©, @ and (fTU|) to calculate E(Fi(k)), the expected number 
of balls in each urn, instead of Fi(k), the actual number of balls. The expected values will 
not in general be integral. The plots of the average Qj against j for the stochastic simulation 
(solid line) and the deterministic computation (dashed line) are shown in Figure [5J The 
asymptotic average value of Qj for the deterministic computation was 11.192 and for the 
stochastic simulation 11.145. 
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Figure 1: Convergence of 0j to 9 

4 A Model for Protein Networks 

As mentioned in the introduction, exponential cutoff has been observed in several networks. 
Our model can be directly applied to web graphs }MBSA02] . where balls represent web pages 
and pins represent links, to e-mail networks |EMB02j . where balls represent e-mail accounts 
and pins represent e-mail messages, and to protein networks | JMBOOl] . where balls represent 
proteins and pins represent protein interactions. In a web graph removing a ball corresponds 
to deleting a web page, in an e-mail network removing a ball corresponds to a user's e- 
mail account being removed from the network, and in a protein network removing a ball 
corresponds to gene loss resulting in the loss of a protein. The other category of network 
exhibiting exponential cutoff mentioned in the introduction, such as collaboration and actor 
networks, will be the subject of a follow-up paper. 

We note that some types of network, viz. protein, collaboration and actor networks, are 
essentially undirected. Consequently, in our model, a new interaction between two proteins, 
for example, ought to be represented by two separate events, corresponding to the new in- 
teractions for each of the two proteins. This would correspond to taking in pairs the events 
of attaching a pin to a ball. We ignore this complication, but note that many of the models 
proposed, for example for the web graph, similarly ignore the difference between directed and 
undirected graphs (e.g. |BA99j ). 

As a proof of concept, we focus here on protein networks, and in particular we will 
examine a yeast protein interaction network |.TMB( )0l] . This is an undirected graph that 
can be downloaded from www.nd.edu/~networks/database/protein. To obtain the values 
for g and q we performed a nonlinear regression on a log-log transformation of the degree 
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Figure 2: Comparison of the stochastic simulation and the deterministic computation 



distribution of the yeast network data obtained from this website, fitting to the equation 

y = a — (g + l)x + ln(q) exp(x) (25) 

implied by (|19|). where a is a constant. The values of g and q obtained from the regression of 
the complete yeast data set are g = 1.065 and q = 0.9642. 

We next performed a stochastic simulation to test the validity of our model. In order to 
carry out the simulation we require values of p and k. From (jSJ) and Q using the fact that 



(k/3 k 



we obtain 



ballsk 



:p-(l-p)(l 



and 



Pins k 
k 



£>(1 ~P), 



(26) 



(27) 



where ballsk and pinsk stand for the expected numbers of balls and pins in the urns at stage 
k, respectively. (The right-hand sides of (|26|) and ()27|) are the limiting values of the left-hand 
sides as k tends to infinity.) 

From these we obtain an equation for the branching factor bf, viz. 



bf 



and from (|2*£|) we can derive 



pins k _ g(l -p) 
balls k 

+ 6/(1-?) 



P 



+ 6/(2-g)' 



(28) 



(29) 
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From the original yeast data, the values of ballsk and pinsk were 1870 and 4480, respec- 
tively, which give a branching factor bf = 2.3957. Using the values of g and q from the 
regression and this value of bf, from ()29|) we obtain a value of p = 0.3245 to use in the simu- 
lation. Using this value of p, from (|26j) or (|27j) we obtain a value of k = 6227. (Alternatively, 
we could have used Q24|) to estimate p, giving the value 0.3026. We preferred the previous 
method, as this avoids the sensitivity of the hyper geometric function for values of q near 1.) 

We carried out 10 simulation runs of the stochastic process using these values of p, q and 
k. Using the value of pinsk from the simulations we obtained an estimate of g from (|27jh The 
average value for ballsy was 1869, for pins^ was 4795, and for g was 1.14. 

Although these values seem to provide a good fit to the original data, as a further validation 
we investigated the value of 6. Its value can estimated from © as 



sjpms, 
pins k 



where sqpinsk is given by 



k 

sqpins k = ^i 2 Fi(k). 

i=l 



The value sqpinst from the original data is 29140. Using (|30ft this gives the empirical 
value 6 = 6.5045. 

We have three equations for 6: the first is the approximation given by ()30|) . the second is 
(|23j). and the third, derived from (fT3"|) . is 

remembering that g = 1//3. 

Using the value sqpinsk = 40195 from the simulation, the first estimate of 8, from (|30j) . 
is 8.3823. The second estimate, from ()23[) . is 9.3057, while the third estimate, from (]31[). is 
10.0629. It can be seen that the empirical value of 9 is not consistent with these estimates 
from the mean-field equations. 

We suggest that one reason for this inconsistency is due to problems in fitting power-law 
type distributions, due to difficulties with non-monotonic fluctuations in the tail. (Another 
reason maybe the sensitivity of the nonlinear regression to the cutoff parameter q.) In partic- 
ular, the presence of gaps in the degree distribution is the main manifestation of this problem. 
There is a gap in the degree distribution at i if there are no nodes of degree i but there exists 
some node of degree j, where j > i. We discussed this problem more fully in the context 
of a pure power-law distribution in |FLL05j . and concluded that a preferable approach is to 
ignore all data points from the first gap onwards. Evidence of the advantage of discarding 
data points in the tail of the distribution was also given in |GMY04] , who suggest the more 
radical approach of using only the first five data points. 

Following this approach we only use the first 15 data points in the degree distribution, 
since the first gap occurs at i = 16. The values of g and q obtained from the nonlinear 
regression fitting (|25j) to the truncated data set, are g = 0.5586 and q = 0.879. The first 
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15 data points as well as the computed regression curve are shown in Figure EH Using these 
values of g and q together with the values of ballst and pinsk from the original data, we 
computed p = 0.2615 from (gJJJ and k = 10860 from or l{27 |) . 

We then repeated the 10 stochastic simulations using the new values of p, q and k. The 
average estimate for ballsy was 1858, for pins/- was 4463, and for g using the values of pins^ 
from the simulations and (j27j) was 0.5565. The first estimate of 9, using (|30|) and the value 
sqpinsk = 24666 from the simulation, was 5.5263. The second estimate was 5.5660 from 
(|23|) . while the third estimate was 5.5743 from (|31|) . It can be seen that these estimates are 
significantly more consistent with the empirical values of sqpinsk and 9 from the original data 
set. 

We then carried out a further verification of the applicability of our methodology. We 
computed the average number of balls in each urn over the second 10 simulation runs; the 
first urn that contained no balls in any of the simulations was urn 32. Next, we performed a 
nonlinear regression on the first 31 urn averages using (|25|) . as before. The values of g and 
q obtained from this regression were g = 0.5624 and q = 0.8880, which are very close to the 
corresponding values obtained from the regression on the truncated data set. 




0.5 1 1.5 2 2.5 

Log of number of interactions 



Figure 3: Yeast protein interaction network data 

Jeong et al. jJMBOOl] fit the data set to a power-law distribution with an exponential 
cutoff. However, they shift all the degree values by one, in order to obtain a better fit for small 
degrees. They report q as approximately exp(— 0.05) = 0.9512 and g as approximately 1.4 
(see supplementary information provided by the authors of |.TMB()0T] ). In order to compare 
our results with theirs, we repeated the nonlinear regression on the first 15 data points, using 
(|25|) . taking the degrees of the data points to be from 2 to 16. This gave the values q = 0.9828 
and g = 1.513, which are comparable to Jeong et al.'s results. 
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A confirmation of the existence of a cutoff can be obtained as follows. Let us assume there 
is no cutoff, i.e. q = 1- Then, from (|13|) we have g = 1/(1 —p). Using this and (|28j) we derive 
bf « q/{q— 1), and thus g « bf /(bf — 1). Now, since we are assuming there is no cutoff, we 
can perform a linear regression on the the log-log transformation of the first 15 data points, 
i.e. fitting to (|25|) with q = 1, which yields g = 1.251. 

Now, using the empirical branching factor bf = 2.3957 from the original data, we can 
compute g as 2.3957/1.3957 = 1.7165. This significantly deviates from the value 1.251, 
obtained from the linear regression. Alternatively, we can compute the branching factor as 
bf ~ 1.251/0.251 = 4.9841, which deviates significantly from the empirical branching factor. 
These discrepancies lead us to conclude that a cutoff does exist, i.e. that q < 1. 

5 Concluding Remarks 

We have presented an extension of Simon's classical stochastic process, which results in a 
power-law distribution with an exponential cutoff, and for which the power-law exponent 
need not exceed two. When viewing this stochastic process in terms of an urn transfer model, 
the difference from the classical process is that, after a ball is chosen on the basis of preferential 
attachment, with probability 1 — q the ball is discarded. Under our assumption that, for large 
k, the normalising factor [3k can be approximated by the constant (5k, we have derived the 
asymptotic formula (|19j) . which shows that the distribution of the number of balls in the urns 
approximately follows a power-law distribution with an exponential cutoff. We note that we 
have, in fact, derived a more accurate formula (|18j) in terms of gamma functions. 

Exponential cutoffs have been identified in protein, e-mail, actor and collaboration net- 
works, and possibly in the web graph |MBS A02] ; it is likely exponential cutoffs also occur in 
other complex networks. Our model assumes that balls are discarded rather than just be- 
coming inactive as in actor and collaboration networks (the treatment of such networks with 
inactive nodes will be dealt in a subsequent paper). We have validated our model with data 
from a yeast protein network, showing that our model provides a possible explanation for the 
exponential cutoff. We have also presented convincing numerical evidence of the existence of 
a cutoff in this network. 

In addition, we have checked that our model is consistent with the emergence of the 
power-law distribution for inlinks in the web graph. However, in this case q is probably very 
close to one |MBSA02] . and this may be the reason that we have not managed to establish 
the existence of an exponential cutoff for the web graph. 
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